Skip to content

Cosmos: Fix 410/1002 PartitionKeyRangeGone surfacing on query/change-feed paths during split/merge#49436

Open
NaluTripician wants to merge 2 commits into
mainfrom
nalutripician/fix-410-1002-pkrange-gone-query-retry
Open

Cosmos: Fix 410/1002 PartitionKeyRangeGone surfacing on query/change-feed paths during split/merge#49436
NaluTripician wants to merge 2 commits into
mainfrom
nalutripician/fix-410-1002-pkrange-gone-query-retry

Conversation

@NaluTripician

@NaluTripician NaluTripician commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Description

The PartitionKeyRangeGoneRetryPolicy previously:

  • retried 410/1002 (PartitionKeyRangeGone) only once (volatile boolean retried), and
  • ignored 410/1007 (CompletingSplitOrMerge) and 410/1008 (CompletingPartitionMigration) — they fell through to the next policy.

As a result, during a slow / in-progress partition split or merge, callers could see a transient 410 surfaced as a hard error. The bulk (BulkOperationRetryPolicy) and transactional-batch (TransactionalBatchRetryPolicy) policies already handle all three sub-statuses with a bounded retry budget — this policy was the outlier.

Fix

  • Replaced the one-shot boolean retried with a bounded AtomicInteger retryCount (MAX_RETRY_COUNT = 10), matching the AtomicInteger pattern already used by GoneAndRetryWithRetryPolicy.
  • Broadened the matched 410 sub-statuses to PARTITION_KEY_RANGE_GONE + COMPLETING_SPLIT_OR_MERGE + COMPLETING_PARTITION_MIGRATION, force-refreshing the routing map per attempt and retryAfter(Duration.ZERO).

This mirrors the .NET SDK fix in Azure/azure-cosmos-dotnet-v3 #5941.

Scope note for reviewers

PartitionKeyRangeGoneRetryPolicy is constructed on two paths: the query path (DefaultDocumentQueryExecutionContext) and the change-feed reader path (ChangeFeedFetcher). This change therefore affects change-feed split/merge handling as well — please consciously sign off on that. The changelog wording has been broadened to "query and change-feed paths" to reflect this.

Self-review (no double-handling)

A deep self-review confirmed the query/change-feed policy and the transport-layer GoneAndRetryWithRetryPolicy (point reads/writes via ReplicatedResourceClient) operate on mutually exclusive code paths, so adding 1007/1008 here does not double-handle. Retry-count semantics verified: getAndIncrement() yields exactly 10 retries (prior values 0–9) then surfaces on the 11th. Delegation to nextRetryPolicy for non-matching exceptions is preserved; the reactive Mono flow is intact.

CI

All build/test jobs pass. The single red leaf job — Test Emulator windows2022_Spark35Scala213IntegrationTests…Java17ChangeFeedPartitionReaderITest "should honor endLSN during split and should hang" — is a timing flake, not caused by this change: the same test passed in the sibling Spark 3.5 / Scala 2.12 job in the same build (the Java SDK class under review is byte-identical across those two jobs). The assertion (future.isCompleted shouldEqual true after a fixed sleep + poll) is inherently racy under emulator ingestion timing.

⚠️ Pre-merge requirement (not a ready blocker)

This change alters behavior of a previously-untested class (it still carried // TODO: this need testing). Unit tests should be added before merge. I could not run the Maven build locally, so the existing coverage relies on CI. Recommended StepVerifier cases:

  1. 410/1002, 410/1007, 410/1008 each → retryAfter(ZERO) + routing-map force-refresh invoked.
  2. 11 consecutive matching 410s → first 10 retry, 11th yields ShouldRetryResult.error(...) (locks the 10-retry boundary).
  3. Non-matching exception → delegates to nextRetryPolicy.shouldRetry exactly once.
  4. Null routing map from tryLookupAsync → no NPE, returns retry.

🤖 Generated via the Seon workflow (cross-SDK port of the .NET 410/1002 fix; mandatory self-review + CI adjudication applied).

…ring split/merge

The query-path PartitionKeyRangeGoneRetryPolicy retried 410/1002 only once and ignored 410/1007 (CompletingSplitOrMerge) and 410/1008 (CompletingPartitionMigration), surfacing transient 410s to query callers during a partition split/merge. It now refreshes the routing map and retries those sub-statuses up to 10 times (using an AtomicInteger counter), matching the bulk/transactional-batch retry policies. Mirrors the .NET SDK fix (Azure/azure-cosmos-dotnet-v3 PR #5941).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ed paths

Self-review noted PartitionKeyRangeGoneRetryPolicy is also used by the change-feed reader path (ChangeFeedFetcher), not only queries. Clarify the changelog scope accordingly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@NaluTripician NaluTripician changed the title Cosmos: Fix 410/1002 PartitionKeyRangeGone surfacing on query path during split/merge Cosmos: Fix 410/1002 PartitionKeyRangeGone surfacing on query/change-feed paths during split/merge Jun 10, 2026
@NaluTripician NaluTripician marked this pull request as ready for review June 10, 2026 20:26
Copilot AI review requested due to automatic review settings June 10, 2026 20:26
@NaluTripician NaluTripician requested review from a team and kirankumarkolli as code owners June 10, 2026 20:26

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Cosmos DB Java SDK’s internal PartitionKeyRangeGoneRetryPolicy to better handle transient 410 (Gone) responses during partition topology transitions (split/merge/migration) on query and change-feed request paths, aligning behavior with other retry policies in the SDK.

Changes:

  • Replaced one-shot retry (volatile boolean retried) with a bounded retry budget using AtomicInteger (max 10 retries).
  • Expanded the handled 410 sub-status codes from only 1002 to also include 1007 and 1008, forcing routing-map refresh and retrying immediately (Duration.ZERO).
  • Updated CHANGELOG.md to describe the corrected behavior for query and change-feed paths.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/PartitionKeyRangeGoneRetryPolicy.java Adds bounded retry budget and broadens 410 sub-status handling with per-attempt routing-map refresh.
sdk/cosmos/azure-cosmos/CHANGELOG.md Documents the fix and the expanded 410 sub-status retry behavior.

Comment on lines +103 to +104
return refreshedRoutingMapObs.flatMap(rm ->
Mono.just(ShouldRetryResult.retryAfter(Duration.ZERO)));
Comment on lines +29 to +30
private static final int MAX_RETRY_COUNT = 10;
private final AtomicInteger retryCount = new AtomicInteger(0);
#### Breaking Changes

#### Bugs Fixed
* Fixed transient `410/1002` (`PartitionKeyRangeGone`) errors surfacing to callers during a partition split or merge. The `PartitionKeyRangeGoneRetryPolicy` (used on the query and change-feed paths) previously retried only once and ignored the in-progress `410/1007` (`CompletingSplitOrMerge`) and `410/1008` (`CompletingPartitionMigration`) sub-status codes; it now refreshes the routing map and retries those sub-statuses up to 10 times before surfacing the error.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants